Economic Benefits of Getting a Degree


Data Bootcamp Final Project
Yuxin Xia (yx784@nyu.edu)
Dec.15th 2016

Overview:
There are many reasons for people going to colleges, such as obtaining knowledge, meeting new people and expanding social networks. Within these benefits, getting a diploma is an attractive one. In most cases, people believe that diplomas will benefit their career in the future and help them earn more.
In this project, I will try to quantity the economic benefits of getting a degree(including bachelor, master and doctor degrees) with two main data availabe online: percentage of workers with different degrees and the weekly salary of each occupation. Compared with the tuition people paid to get a degree, this project can help people see if their college education is worthwhile from an economic perspective of view.

Related Website:


In [1]:
import sys                             # system module
import pandas as pd                    # data package
import matplotlib as mpl               # graphics package
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly 
plotly.offline.init_notebook_mode() # run at the start of every ipython notebook


General Picture

First, I am going to show a general picture of the annual wage for people with different education levels. The table I got contains the median annual wage for people of different degrees.

Data Sourse:
title of the file: Employment, wages, and projected change in employment by typical entry-level education (Employment in thousands)
go to page and click on “Other available formats: (XLSX)” on the top to get the file downloaded to the laptop.
I uploaded the file to Github.


In [2]:
#read the file from github and only take the sheet with name "summary"
data_general = pd.read_excel("https://github.com/yuxinXia96/final-project/raw/master/education.xlsx",\
                             sheetname = 'Summary')

In [3]:
#drop rows with missing data
data_general= data_general.dropna()
#keep columns: oppupation and annual wage, drop cloumns: number of worker, percentage distribution, employment change
data_general = data_general.drop(['Unnamed: 1','Unnamed: 2','Unnamed: 3'],axis = 1)
#rename the columns
data_general = data_general.rename(columns = {'Employment, wages, and projected change in employment by typical \
entry-level education (Employment in thousands)':'Education','Unnamed: 4':'Median annual wage(2015)'})

In [4]:
#get the medium annual wage within all the workers, add it as a column
data_general['Mean'] = data_general['Median annual wage(2015)'][2]
#drop the row with information about the total population
data_general = data_general.drop([2])

In [5]:
#the table looks like:
data_general


Out[5]:
Education Median annual wage(2015) Mean
3 Doctoral or professional degree 100490 36200
4 Master's degree 66420 36200
5 Bachelor's degree 70400 36200
6 Associate's degree 50230 36200
7 Postsecondary nondegree award 35660 36200
8 Some college, no degree 33870 36200
9 High school diploma or equivalent 36210 36200
10 No formal educational credential 21420 36200

In [6]:
#plot
general_benefit = dict(type="scatter",                                      
           name="median annual wage for different degrees",                                      
           x=data_general['Education'],                             
           y=data_general["Median annual wage(2015)"],                             
           marker={"color": "blue"},
           mode = 'markers-lines',
            )
general_mean = dict(type="scatter",                                      
           name="Average annual wage",                                      
           x=data_general['Education'],                             
           y=data_general["Mean"],                             
           marker={"color": "purple"},
           mode = 'lines',
            )
layout = dict(                      
              yaxis={"title": "annual wage"},                   
              title="Annual Wage for People of Different Degrees",         
    )
iplot(go.Figure(data=[general_benefit,general_mean],layout=layout))


Although there are some exceptions (exceptions may raise from decresed working experience), generally sepaking, higher education leads to higher annual wage. However, the benefits of getting a degree differ from occupation to occupation. I am going to show the economic benefits of a degree to some detailed occupations.

Economic Benefits Ordered by Detialed Occpations

It’s clear that the higher education a person has, the easier for him/her to find a particular job. I found a file with the percentage of educational attainment for workers ordered by detailed occupations. To quantify the easiness of finding a job, for each education level, I used the cumulative percentage of workers with that occupation.
For each occupation, I multiplied the percentage of the workers with the weekly salary of that occupation. Then I multiplied the result with 52(52 weeks in year)to get the annual income. The changes of annual income with different education levels can show the economic benefits brought by education.

Data Sources:


In [7]:
#read the weekly earning file from the Internet. Notice that the type of the file is list.
wage = pd.read_html("http://www.bls.gov/cps/cpsaat39.htm") 
#data_wage[0] and data_wage[1] have similar data, so only keep one
wage = wage[1] 
#convert list into dataframe
data_wage = pd.DataFrame(wage,columns = ['Occupation','2015','Total'])
#see the size of the file
data_wage.shape


Out[7]:
(570, 3)

In [8]:
#some adjustment to get the data I need -- occupation,number of workers, weekly salary

#total wage: remove $ sign
data_wage = data_wage.replace(to_replace='$809',value=809) 
#replace '-' with missing data NaN
data_wage = data_wage.replace('-',pd.np.nan)
#change the data type of 'Total'
data_wage['Total'] = data_wage['Total'].astype(float) 
#drop the rows with missing data
data_wage = data_wage.dropna()
#rename the column labels
data_wage = data_wage.rename(columns = {'2015':'number of workers','Total':'weekly salary'})
#see the first 5 rows of the table
data_wage.head(5)


Out[8]:
Occupation number of workers weekly salary
0 Total, full-time wage and salary workers 109080.0 809.0
2 Management, professional, and related occupations 44844.0 1158.0
3 Management, business, and financial operations... 18422.0 1258.0
4 Management occupations 12480.0 1351.0
5 Chief executives 1046.0 2041.0

In [9]:
#read education data from the file I have uploaded to the Internet, only read the sheet I need
data_education = pd.read_excel("https://github.com/yuxinXia96/final-project/raw/master/occupation.xlsx",sheetname=11)
#see the size of the table
data_education.shape


Out[9]:
(823, 9)

In [10]:
#some adjustments to get the cumulative percentage of each degree for every occupation
#basically add the pertage of workers with education equal or lower than a certain degree
data_education['%high school']=data_education['Unnamed: 2']+data_education['Unnamed: 3']
data_education['%bachelor']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
    +data_education['Unnamed: 5']+data_education['Unnamed: 6']
data_education['%master']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
    +data_education['Unnamed: 5']+data_education['Unnamed: 6']+data_education['Unnamed: 7']
data_education['%doctoral']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
    +data_education['Unnamed: 5']+data_education['Unnamed: 6']+data_education['Unnamed: 7']+data_education['Unnamed: 8']
data_education = data_education.drop(['Unnamed: 1','Unnamed: 2','Unnamed: 3','Unnamed: 4','Unnamed: 5'],axis=1)
data_education = data_education.drop(['Unnamed: 6','Unnamed: 7','Unnamed: 8'],axis=1)
#treat 0 percentage of workers in a certain occupation as missing value
data_education = data_education.replace(0,pd.np.nan)
#get rid of the rows with missing value
data_education = data_education.dropna()
#drop the first row which is not an observation but some descriptions
data_education = data_education.drop([0])
#rename the column lables
data_education = data_education.rename(columns=
{'Table 1.11 Educational attainment for workers 25 years and older by detailed occupation, 2014-15':'Occupation'})
#get rid of all the (1) in the occupation list
data_education = data_education.replace('\(1\)','', regex=True)
#see the top 5 rows of the table
data_education.head()


Out[10]:
Occupation %high school %bachelor %master %doctoral
1 Total, all occupations 32.9 86.1 95.8 100.0
2 Chief executives 10.7 71.2 92.7 100.1
3 General and operations managers 19.4 87.6 98.5 100.0
4 Legislators 10.7 71.2 92.7 100.1
5 Advertising and promotions managers 5.0 85.8 97.9 99.9

In [11]:
#merge two tables(weekly wage and education)
benefits = pd.merge(data_wage, data_education,   # left and right df's
                 how='left',        # add to left 
                 on='Occupation'       # link with this variable
                )

In [12]:
#calculate the change in weekly salary
benefits['%added bachelor']=benefits['%bachelor']-benefits['%high school']
benefits['%added master']=benefits['%master']-benefits['%bachelor']
benefits['%added doctoral']=benefits['%doctoral']-benefits['%master']
#calulate the change in annual salary
benefits['B benefit']=benefits['%added bachelor']*benefits['weekly salary']*52
benefits['M benefit']=benefits['%added master']*benefits['weekly salary']*52
benefits['D benefit']=benefits['%added doctoral']*benefits['weekly salary']*52
#drop rows with missing value
benefits = benefits.dropna()

In [13]:
#plot
salary1 = dict(type="scatter",                                      
           name="economic benefit of a bachelor degeree",                                      
           x=benefits['Occupation'],                             
           y=benefits["B benefit"],                             
           marker={"color": "yellow"},
           mode = 'markers',
          # size = benefits['number of workers']
          #remove ticks on x axis
            )
salary2 = dict(type="scatter",                                      
           name="economic benefit of a master degree",                                      
           x=benefits['Occupation'],                             
           y=benefits["M benefit"],                             
           marker={"color": "orange"},
           mode = 'markers',
          # size = benefits['number of workers']
          #remove ticks on x axis
            )
salary3 = dict(type="scatter",                                      
           name="economic benefit of a doctorial degree",                                      
           x=benefits['Occupation'],                             
           y=benefits["D benefit"],                             
           marker={"color": "red"},
           mode = 'markers',
          # size = benefits['number of workers']
          #remove ticks on x axis
            )
#marker = dict(size = benefits['number of workers'])
layout = dict(                      
              yaxis={"title": "economic benefits(in $)"},                   
              title="Economic Benefits for Different Oppupations",         
    )

iplot(go.Figure(data=[salary1,salary2,salary3],layout=layout))


From the plot, it shows that Bachelar degree is most helpful in increasing salaries. Master and doctor degrees are more helpful to people with jobs in finance, research and education field(left half of the graph) than blue collars(right half of the graph). One interest point is the highest red dot. A doctor degree has huge impact on lawyers.

Economic Benefits of Different Degrees to Business People

As a Stern student, I am particularly interested in the economic benefits of different degrees to people working in business and finance field. As a result, I took out some occupations with words related to business in their titles.


In [14]:
#get a new table with occupations related to business, finance, operation and accounting
b = benefits[benefits.Occupation.str.contains('Financial')|benefits.Occupation.str.contains('operation')|\
         benefits.Occupation.str.contains('business')|benefits.Occupation.str.contains('account')|\
        benefits.Occupation.str.contains('financial')]
#see the data
b


Out[14]:
Occupation number of workers weekly salary %high school %bachelor %master %doctoral %added bachelor %added master %added doctoral B benefit M benefit D benefit
5 General and operations managers 823.0 1260.0 19.4 87.6 98.5 100.0 68.2 10.9 1.5 4468464.0 714168.0 98280.0
11 Financial managers 1124.0 1408.0 11.1 77.8 97.5 100.1 66.7 19.7 2.6 4883507.2 1442355.2 190361.6
40 Business operations specialists, all other 186.0 1090.0 11.3 75.7 95.7 100.0 64.4 20.0 4.3 3650192.0 1133600.0 243724.0
42 Financial analysts 295.0 1426.0 2.3 63.2 96.2 100.0 60.9 33.0 3.8 4515856.8 2447016.0 281777.6
43 Personal financial advisors 407.0 1419.0 4.4 73.2 94.2 99.9 68.8 21.0 5.7 5076614.4 1549548.0 420591.6
48 Financial specialists, all other 66.0 1162.0 12.3 77.9 97.5 100.1 65.6 19.6 2.6 3963814.4 1184310.4 157102.4
203 Securities, commodities, and financial service... 211.0 1155.0 8.1 79.6 97.3 99.9 71.5 17.7 2.6 4294290.0 1063062.0 156156.0
211 Bill and account collectors 152.0 657.0 33.9 97.3 99.5 100.0 63.4 2.2 0.5 2165997.6 75160.8 17082.0
213 Bookkeeping, accounting, and auditing clerks 769.0 692.0 31.4 96.4 99.6 100.1 65.0 3.2 0.5 2338960.0 115148.8 17992.0
216 Financial clerks, all other 61.0 767.0 17.5 85.1 99.0 100.0 67.6 13.9 1.0 2696158.4 554387.6 39884.0

In [15]:
#plot
s1 = dict(type="bar",                                      
           name="economic benefit of bachelor degeree",                                      
           x=b['Occupation'],                             
           y=b["B benefit"],                             
           marker={"color": "yellow"}
            )
s2 = dict(type="bar",                                      
           name="economic benefit of master degree",                                      
           x=b['Occupation'],                             
           y=b["M benefit"],                             
           marker={"color": "orange"},
            )
s3 = dict(type="bar",                                      
           name="economic benefit of doctorial degree",                                      
           x=b['Occupation'],                             
           y=b["D benefit"],                             
           marker={"color": "red"},
            )
#marker = dict(size = benefits['number of workers'])
layout = dict(                      
              yaxis={"title": "economic benefits (in $)"},                   
              title="Economic Benefits for Oppupations in Business and Finance Field",         
    )

iplot(go.Figure(data=[s1,s2,s3],layout=layout))


The picture shows that bachelor degrees are of great help to business people. Master degrees are helpful to people in finance field, but not that helpful to people who have the work related to accounting. Doctor degrees are helpful to people who work as analysts and specialists, but add less value compared with bachelor or master degrees.

Cost of Getting a Degree: Tuition

After examing the economic benefits of getting a degree, we now focus on the cost paid to get a degree. The cost we consider here is tuition. Unlike the payment for future jobs, people usually know exactly how much they need to pay for a certain school they are accepted. As a result, I am only going to show a general picture of college tuitions without seperating colleges according to their locations or major focuses. The annual tuition here is just a rough estimation of the cost.

Data Source:
The website leads to an online database. The important key words I chose are: use final release data; select institutions: by group—U.S. only; select variables: tuition and fee (2015-16). After selecting these key words, the database will provide a csv file with the data needed. I uploaded the file to Github.


In [16]:
data_tuition = pd.read_csv('https://raw.githubusercontent.com/yuxinXia96/final-project/master/CSV_1262016-477.csv')

In [17]:
#drop the columns with year info and ID info
data_tuition = data_tuition.drop(['unitid','year'],axis=1)
#drop rows with missing data
data_tuition = data_tuition.dropna()
#rename columns
data_tuition = data_tuition.rename(columns = {'institution name':'institution',
                                              'DRVIC2015.Tuition and fees, 2015-16':'tuition'})
#calculate average tuition
data_tuition['mean'] = data_tuition['tuition'].mean()

In [18]:
#plot
t = dict(type="scatter",                                      
           name="tuition 2015-16",                                      
           x=data_tuition['institution'],                             
           y=data_tuition["tuition"],                             
           marker={"color": "pink"},
           mode = 'markers',
            )
m = dict(type="scatter",                                      
           name="mean",                                      
           x=data_tuition['institution'],                             
           y=data_tuition["mean"],                             
           marker={"color": "brown"},
           mode = 'lines',
            )
layout = dict(                      
              yaxis={"title": "tuition(in $)"},                   
              title="Tuition 2015-16",         
    )
iplot(go.Figure(data=[t,m],layout=layout))


From the picture, it shows that the average tuition is around 15K per year, and tuition is almost within the range from 5K to 50K.

Recall the pictures we did previously, the benefits of barchalor degrees are mostly above 1.5M. Thus, going to a college is worthwhile even without considering the gain of knowledge and improvement of social networks. Also, the annual salary increase doesn’t consider the lifelong benefits from education. The increases in earnings will last much longer than the school years. So the actual economic benefits will be bigger than what we can see from the graph.


In [ ]: